Covid19 Project
2 How did the research you gathered contribute to your question development?
Our group was interested in looking further into data related to COVID-19 because it was a very timely and relevant topic as we are living in this new normal. Before researching what data sets were out there, we brainstormed possible COVID-19 topics we could begin to delve in. A few of those topics discussed were hospital utilization during COVID-19, policy, and testing. We each found data sets that could be best related to the topics we have discussed and chose the one that best fit the requirements for our project.
After reviewing our chosen data set and having background knowledge that there are factors that could lead to an increase of cases, we chose our ultimate SMART question, “What are factors that lead to an increase in COVID-19 cases and death rates?” After learning more about the virus and potential factors, we understood that risk factors like health conditions, race, age, adherence to policies, and etc. could have an impact on the total cases of COVID-19. Our question also derived from pure curiosity, since we did have knowledge on the virus beforehand, and wanted to find the answers to our SMART question ourselves in our own data analysis. However, with the information that we did know, we came up with sub-questions, which will be further discussed later on, to help determine our final SMART question and make sure our final question was able to be determined through the variables in our data set.
3 What additional information would have been beneficial?
Additional information in our data set could have been beneficial to aid our analysis. For example, it would have been helpful to have testing information for each city and not by state as in the data set. That variable was not consistent with the rest of the data collection in the set. It also would have been beneficial to have ongoing data, rather than it ending in April to have the most updated data. Since scientists are finding new information on COVID-19 as time goes on, our results could have been more relevant or have more significance to the situation if we had more recent data because the results may have shown different conclusions.
4 How did your question change, if at all, after EDA?
In the beginning we asked a few questions, such as “Which race is the majority of the sample?” and “Are patient from a certain race?” In the EDA study, we deleted the last sentence. Because this is an overall study of the COVID-19 epidemic in different regions of the United States, not a study of individual individuals. We cannot determine the race of each confirmed individual.
We also deleted “which race has the most average death rate and total cases” The reason is the same as above, because we cannot determine the situation of each individual and cannot make statistics on this problem. We can only observe the correlation coefficients between total cases, death and different proportion of races based on the correlation coefficient graph. Therefore, we changed the question to “The proportion of which race is related to the number of confirmed cases and deaths”.
We added a few more questions, “Are the total cases related to age/gender/Poverty?” We first divided total cases into four levels, and then found that the average values of these variables at different levels are significantly different, so we determined they are related to total cases.
We also set another question at the beginning, “Have there been any general trends among the health conditions?”. Studies have shown that the correlation coefficient between health (such as sleep status, medical history of various diseases, smoking, obesity, etc.) and death is not large. Only the correlation coefficient between liver_total_death and death is relatively high.
We deleted the question “Are there any common underlying health conditions?” and changed it to “Does any disease relate to the death rate?”.
5 Based on EDA can you begin to sketch out an answer to your question?
5.1 United States COVID-19 Cases and Deaths by Provinces (Cities)
5.1.1 What are the top 15 Provinces based on the number of cases?
The following bar chart shows the top 15 cities by number of Covid-19 cases.
The above Bar chart shows the top 15 provinces determined by the number of cases. New York province is highest city with number of covid19 cases, the total number is over 100000, while the number of cases in other cities is less than 30000.
5.1.2 What are the top 15 Provinces based on the number of deaths?
The following bar chart shows the top 15 cities by number of deaths.
The above Bar chart shows the top 15 provinces determined by the number of deaths. New York province is highest city with number of deaths around 8000, while the number of deaths in other cities is less than 1000.
5.1.3 What are the top 15 States based on the number of Tests?
The above Bar chart shows the top 15 States determined by the number of tests. It can be clearly seen that the number of tests has been done in New York State is around 499,143 tests which is considered to be the highest among the other states. Furthermore, the number of test has been done in other states is less than 200k.
5.1.4 What is the average cases for each State?
State total_cases
1 Alabama 59.03
2 Alaska 9.83
3 Arizona 258.60
4 Arkansas 19.44
5 California 437.43
6 Colorado 122.41
7 Connecticut 1682.00
8 Delaware 638.33
9 District of Columbia 2058.00
10 Florida 323.07
11 Georgia 85.74
12 Hawaii 101.60
13 Idaho 33.32
14 Illinois 227.48
15 Indiana 94.12
16 Iowa 19.20
17 Kansas 13.84
18 Kentucky 17.32
19 Louisiana 335.34
20 Maine 45.88
21 Maryland 394.75
22 Massachusetts 1843.87
23 Michigan 316.96
24 Minnesota 19.10
25 Mississippi 37.68
26 Missouri 40.98
27 Montana 7.18
28 Nebraska 9.48
29 Nevada 184.35
30 New Hampshire 103.50
31 New Jersey 3196.29
32 New Mexico 40.82
33 New York 3274.52
34 North Carolina 51.20
35 North Dakota 6.45
36 Ohio 82.81
37 Oklahoma 28.51
[ reached 'max' / getOption("max.print") -- omitted 14 rows ]
5.1.5 What is the average deaths for each State?
State deaths
1 Alabama 1.701
2 Alaska 0.172
3 Arizona 7.133
4 Arkansas 0.427
5 California 13.328
6 Colorado 5.109
7 Connecticut 83.375
8 Delaware 14.333
9 District of Columbia 67.000
10 Florida 7.836
11 Georgia 3.270
12 Hawaii 1.800
13 Idaho 0.750
14 Illinois 8.510
15 Indiana 4.207
16 Iowa 0.444
17 Kansas 0.657
18 Kentucky 0.900
19 Louisiana 15.922
20 Maine 1.250
21 Maryland 12.667
22 Massachusetts 49.600
23 Michigan 21.133
24 Minnesota 0.920
25 Mississippi 1.366
26 Missouri 1.284
27 Montana 0.143
28 Nebraska 0.161
29 Nevada 7.059
30 New Hampshire 0.300
31 New Jersey 133.476
32 New Mexico 0.939
33 New York 174.871
34 North Carolina 1.130
35 North Dakota 0.151
36 Ohio 3.705
37 Oklahoma 1.403
[ reached 'max' / getOption("max.print") -- omitted 14 rows ]
5.1.6 Which cities had the greatest % of population of people with poor health?
5.2 Patient Demographics
5.2.1 What are the patient demographics?
[1] "D:/study/6101/repo/Data_Science"
| TC | Population | young | old | black | AIAN | Asian | NH | Hispanic | NHW | Female | Poverty | Social | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | 0 | 88 | 0.0 | 4.8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.6 | 2.7 | 26.8 | 3.4 | 0.0 |
| Q1 | 2 | 11034 | 20.1 | 16.3 | 0.7 | 0.4 | 0.5 | 0.0 | 2.4 | 64.7 | 49.4 | 11.4 | 8.2 |
| Median | 9 | 25758 | 22.1 | 19.0 | 2.2 | 0.6 | 0.7 | 0.1 | 4.4 | 83.5 | 50.3 | 14.8 | 11.1 |
| Mean | 191 | 105871 | 22.1 | 19.3 | 8.8 | 2.4 | 1.5 | 0.1 | 9.6 | 76.2 | 49.9 | 15.9 | 11.6 |
| Q3 | 39 | 67013 | 23.8 | 21.8 | 9.6 | 1.3 | 1.4 | 0.1 | 9.9 | 92.3 | 51.0 | 19.0 | 14.4 |
| Max | 110465 | 10105518 | 42.0 | 57.6 | 85.4 | 92.5 | 43.4 | 48.9 | 96.4 | 97.9 | 56.9 | 48.6 | 52.3 |
From the average of the output results, we can see that the average proportion of teenagers under the age of 18 is 22.1%, and the average proportion of people over 65 is 19.3%. The largest number of all races is Non-Hispanic White, with an average proportion of 76.2. The average proportion of women is 49.9, the average proportion of the poor is 15.9%, and the average of the Social Association Rate is 11.6. We divide the data into four levels according to total cases.
5.2.2 Which race is the majority of the sample?
According to the average value, we get a pie chart of race proportions, from which we can see the overall proportions of different races.
5.3 Stay at home policy in each province
5.4 Underlying Health Conditions
5.4.1 Does any disease relate to the death rate?
It shows liver_total_death is highly correlated to deaths at correlation = 0.4338.
5.5 Impact of Temperature
5.5.1 Does the temperature relate the Total Cases or Death Rate?
tibble [3,144 x 8] (S3: tbl_df/tbl/data.frame)
$ Province : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
$ State : chr [1:3144] "New York" "New York" "New York" "New York" ...
$ days : num [1:3144] 44 41 38 43 82 35 41 80 39 38 ...
$ total_cases : num [1:3144] 110465 25250 22691 20191 16323 ...
$ deaths : num [1:3144] 7905 1001 608 596 577 ...
$ temp_peak : num [1:3144] 8.41 7.41 6.86 5.88 2.25 ...
$ temp_before : num [1:3144] 8.33 7.78 7.1 6.68 2.55 ...
$ temp_current: num [1:3144] 9.23 8.36 7.82 7.53 3.34 ...
By the correlation diagorm, the temperature is less relate to total_cases and deaths.
[1] 0 2 9 39 110465
6 How did you select and determine the correct model to answer your question?
6.1 Linear model
[1] "D:/study/6101/repo/Data_Science"
Call:
lm(formula = deaths ~ Population.Density + GDP + SHP + sleep_hour +
poorhealth, data = lineardf3)
Residuals:
Min 1Q Median 3Q Max
-195.5 -6.3 -1.5 3.0 914.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.1848758118 6.8018600120 -6.05 0.00000000161
Population.Density 0.0022486139 0.0006386507 3.52 0.00044
GDP 0.0000006479 0.0000000319 20.29 < 0.0000000000000002
SHP 1.0518973957 0.2137480942 4.92 0.00000091424
sleep_hour 1.5090360066 0.2395795052 6.30 0.00000000035
poorhealth -1.2298052365 0.2083461901 -5.90 0.00000000404
(Intercept) ***
Population.Density ***
GDP ***
SHP ***
sleep_hour ***
poorhealth ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36.8 on 2584 degrees of freedom
Multiple R-squared: 0.238, Adjusted R-squared: 0.236
F-statistic: 161 on 5 and 2584 DF, p-value: <0.0000000000000002
Population.Density GDP SHP sleep_hour
1.22 1.24 1.29 1.67
poorhealth
1.81
We use the regsubsets function, exhaustive method, to find the best model from two perspectives: BIC and adjusted R-squared. Both methods point to the same model, which contains five variables: Population Density per Square mile of Land, GDP (2018),% Severe Housing Problems, Sleep <7 Hours_Percent,% Fair or Poor Health. From the p-value, these five variables are all significant. VIF shows that these five variables have no high degree of autocorrelation and can be left in the model. The adjusted r-squared is 0.236, indicating that the model explained 23.6% of the variation in death.
Final model: death=-41.185+0.002 Population.Density + 0.0000006479 GDP + 1.051 SHP +1.509 sleep_hour + -1.230 poorhealth
6.2 LASSO Regression
Because there are many variables, Lasso regression is chosen to fit the best model. Lasso regression can change the coefficients of many variables to 0, which plays a role in variable selection.
[1] "D:/study/6101/repo/Data_Science"
lowest lamda from CV: 0.00246
We see that the lowest MSE is when \(\lambda\) appro = 0.002.
Mean MSE for best Lasso lamda: 0.203
All the coefficients :
(Intercept) population young old
-0.00301 0.26499 0.03709 0.02258
black AIAN Asian NH
0.00000 -0.00339 -0.09691 -0.00689
Hispanic NHW Female Rural
-0.00461 0.00751 -0.01883 0.02263
Population.Density
0.11744
The non-zero coefficients :
(Intercept) population young old
-0.00301 0.26499 0.03709 0.02258
AIAN Asian NH Hispanic
-0.00339 -0.09691 -0.00689 -0.00461
NHW Female Rural Population.Density
0.00751 -0.01883 0.02263 0.11744
From LASSO regression, the coefficients of 11 variables are not zero, the coefficients of the remaining variables become zero. From the results, we can see that race, gender, age, population, population density and rural proportions will all have an impact on total cases.
We then calculate the R squared of lasso regression, which is 0.164.
7 What prediction can you make with your model?
For the tempreture part:
tibble [3,144 x 8] (S3: tbl_df/tbl/data.frame)
$ Province : chr [1:3144] "New York City" "Nassau" "Suffolk" "Westchester" ...
$ State : chr [1:3144] "New York" "New York" "New York" "New York" ...
$ days : num [1:3144] 44 41 38 43 82 35 41 80 39 38 ...
$ total_cases : num [1:3144] 110465 25250 22691 20191 16323 ...
$ deaths : num [1:3144] 7905 1001 608 596 577 ...
$ temp_peak : num [1:3144] 8.41 7.41 6.86 5.88 2.25 ...
$ temp_before : num [1:3144] 8.33 7.78 7.1 6.68 2.55 ...
$ temp_current: num [1:3144] 9.23 8.36 7.82 7.53 3.34 ...
By the correlation diagorm, the temperature is less relate to total_cases and deaths. However, it is relate to the days. Which means the higher the temperature in the city, the later the disese occurs in this city. For prediction, the second peak is comming as earlier as the weather turns from summer to winter which means the temperature goes down. As the temperature becomes coders, the second peak of the spread comes earlier.
8 How reliable are your results?
# A tibble: 6 x 9
total_cases deaths sleep_hour heart_disease Low_birthweight adult_obesity
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 25250 1001 38.0 142. 7.89 23.6
2 22691 608 35.6 120. 7.74 24.6
3 20191 596 33.1 97.6 7.95 20.7
4 16323 577 33.4 95.2 8.96 28
5 12209 820 41.5 157. 10.7 34.7
6 10426 550 38.4 78.5 7.83 22.3
# ... with 3 more variables: Food_environment <dbl>, Respiratory <dbl>,
# liver_Total_death <dbl>
Data: X dimension: 1037 8
Y dimension: 1037 1
Fit method: svdpc
Number of components considered: 8
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 65.77 63.38 43.69 40.53 37.79 37.95 26.35
adjCV 65.77 63.40 43.59 40.37 37.71 38.61 26.13
7 comps 8 comps
CV 24.24 23.16
adjCV 24.14 23.09
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
X 38.227 57.98 70.52 78.37 85.00 91.23 96.28 100.00
deaths 7.447 57.73 65.81 70.13 70.94 84.80 87.19 88.69
total_cases sleep_hour heart_disease Low_birthweight
1.84 -3.86 -3.19 -4.07
adult_obesity Food_environment Respiratory liver_Total_death
-4.64 4.34 -3.95 2.05
total_cases sleep_hour heart_disease Low_birthweight
29.3934 10.0243 9.5902 0.0375
adult_obesity Food_environment Respiratory liver_Total_death
-5.4036 3.8446 -6.3601 26.2150
1 2 3 4 5 6 7 8
63.4 62.0 66.4 66.0 16.3 52.7 89.4 49.0
1 2 3 4 5
529 477 403 551 370
deaths PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
1 14.97 -4.5542 9.77 -5.52 4.688 -2.830 6.675 -2.071 -1.739
2 8.99 -4.4172 8.83 -4.70 3.523 -1.910 6.130 -2.319 -1.719
3 8.81 -4.8449 7.45 -3.25 3.167 -2.269 5.943 -1.904 -1.562
4 8.52 -4.8026 11.20 -3.72 -1.834 1.753 0.971 -3.210 -0.165
5 12.22 0.0448 7.93 -3.19 0.242 -0.374 1.186 -1.297 -0.641
6 8.11 -3.5058 4.57 -1.23 2.697 -0.725 2.278 0.186 -0.490
For the linear model with death relate to diesase, I add the total cases as the correlation of death and total cases are autocorrelate. by the Principle Component Method, drawing the cross-validation line and compare the MSEP of the death to the significant disease. The nuber of components drops down when number of components reach to two, which means the pc1 and pc2 can highliy explain the model. This shows the reliable of the the model is good enough.
9 What additional information or analysis might improve your model results or work to control limitations?
The random forest test: easy to understand and interpret. Insensitive to outliers, which means no need to remove the outliers Also, this method is effective on handling the missing data. Low operation computation and require little data pre-processing. It is suit out large datasets with thoungs of rows and missing data. However, as the trees becomg too large, the dificulty of interpret will increase. The variance of tree method is high and performance is low. The model can easily overfitting. Also, the data is not perfect enough to predict death rate since it is outdated. This data last update from April and death number increase a lot since June till now. Also, there are lots of missing data especially for the disease. Only around 20 stats out of 51 have fully record of disease history of patients.These reasons make the available anlaysis data drops down from 3000 rows to 1000 rows. After deleate the missing data Because the COV-19 disease is not a fatal disease and patient will fight with disease for a long time till death. Also, the death of COV-19 is not complete, since many potential patient dead by COV-19 but counted as normal Flu. Also some important category was not in part of the analysis such as blood type.